[native] Add Arrow Flight Connector#24082
Conversation
|
This PR requires a velox support PR facebookincubator/velox#11588, to install Arrow Flight dependencies. |
|
Saved that user @Rijin-N is from IBM |
presto-native-execution/Makefile
Outdated
| CMAKE_FLAGS += -DENABLE_ALL_WARNINGS=${ENABLE_WALL} | ||
| CMAKE_FLAGS += -DCMAKE_PREFIX_PATH=$(CMAKE_PREFIX_PATH) | ||
| CMAKE_FLAGS += -DCMAKE_BUILD_TYPE=$(BUILD_TYPE) | ||
| CMAKE_FLAGS += -DPRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR=$(PRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR) |
There was a problem hiding this comment.
We should stop adding these and instead use EXTRA_CMAKE_FLAGS externally when invoking make
presto-native-execution/presto_cpp/main/connectors/arrow_flight/ArrowFlightConnector.cpp
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/arrow_flight/ArrowFlightConnector.cpp
Outdated
Show resolved
Hide resolved
|
|
||
| CallOptionsAddHeaders callOptsAddHeaders{}; | ||
| FlightCallOptions& callOpts = callOptsAddHeaders; | ||
| AddCallHeaders& headerWriter = callOptsAddHeaders; |
There was a problem hiding this comment.
You can just add headers directly to the FlightCallOptions, you don't need to use the AddCallHeaders interface
There was a problem hiding this comment.
The idea here was to have an append-only interface for adding headers to the FlightCallOptions.
Without this, the Authenticator interface would need to accept a mutable reference to FlightCallOptions instead, which theoretically allows the authenticator to modify other parts of the FlightCallOptions, which is not desirable.
It's also less confusing when creating Authenticator implementations, since you don't need to worry about which parts of the FlightCallOptions should or should not be changed.
presto-native-execution/presto_cpp/main/connectors/arrow_flight/ArrowFlightConnector.cpp
Outdated
Show resolved
Hide resolved
aditi-pandit
left a comment
There was a problem hiding this comment.
Some common themes
i) Please add a copyright header to all the new files added in this PR.
ii) Please follow the test case naming as per conventions in https://github.com/facebookincubator/velox/blob/main/CODING_STYLE.md#naming-conventions
| set_property(TARGET presto_server_lib PROPERTY JOB_POOL_LINK | ||
| presto_link_job_pool) | ||
|
|
||
| if(PRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR) |
There was a problem hiding this comment.
These lines should be in presto_cpp/main/connectors/arrow_flight/CMakeLists.txt
| std::make_unique<IcebergPrestoToVeloxConnector>("iceberg")); | ||
| registerPrestoToVeloxConnector( | ||
| std::make_unique<TpchPrestoToVeloxConnector>("tpch")); | ||
| #ifdef PRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR |
There was a problem hiding this comment.
The previous entries are for the base Velox connectors that we will mostly add by default.
Like in other parts of this code, we should add this registration in ArrowFlightConnector.cpp/.hpp if possible.
Another option is to add a ConnectorRegistration.h(cpp) file in presto_cpp/main/connectors folder that includes all the registrations of connectors in its sub-directories and only call that function from here.
e.g say for Windowfunctions in prestosql there is https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/window/WindowFunctionsRegistration.h that has all individual function registrations.
| return std::make_unique<protocol::tpch::TpchConnectorProtocol>(); | ||
| } | ||
|
|
||
| #ifdef PRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR |
There was a problem hiding this comment.
This logic should be moved to a file in presto_cpp/main/arrow_flight (ArrowPrestoToVeloxConnector.h/.cpp). The registration logic should be part of ConnectorRegistration file proposed.
| for (const auto& protocolSplit : source.splits) { | ||
| auto split = toVeloxSplit(protocolSplit); | ||
| if (split.hasConnectorSplit()) { | ||
| // Since the extra credential data is coming from TaskUpdateRequest |
There was a problem hiding this comment.
There shouldn't be such special case code in TaskManager.cpp.
The more appropriate way to do this is to add the extraCredentials as a parameter in the toVeloxSplit logic and move this specific logic over to the ArrowToPrestoVeloxConnector::toVeloxSplit method.
|
|
||
| std::unique_ptr<StaticFlightServer> StaticFlightServerTest::server; | ||
|
|
||
| TEST_F(StaticFlightServerTest, ServerTest) { |
There was a problem hiding this comment.
Test names start with lower case. Maybe you can call this test "basic" inline with other tests in the code-base.
| std::unique_ptr<StaticFlightServer> StaticFlightServerTest::server; | ||
|
|
||
| TEST_F(StaticFlightServerTest, ServerTest) { | ||
| auto sampleTable = makeArrowTable( |
There was a problem hiding this comment.
Its important to add other types as well... Would be great to demo all the scalar and complex types that are supported by this API.
There was a problem hiding this comment.
The StaticFlightServer is just a thin wrapper around the basic Flight Server provided by the Arrow Flight SDK. Testing this server against specific datatypes is something better reserved for the unit tests in the Arrow C++ SDK. The test here is just to make sure that our wrapper is working and we're able to perform simple operations.
Note that the StaticFlightServer isn't actually used in the Flight connector code - it is only used in other unit tests for validating the Flight connector against a "real" server.
That said, we do test for all the different types supported by the Flight connector in ArrowFlightConnectorDataTypeTest.cpp. These test the full flow of data from the StaticFlightServer to the connector and out. We're testing with ints, reals, string types, date and time types etc.
| {FlightConfig::kServerVerify, "true"}})) {} | ||
| }; | ||
|
|
||
| TEST_F(FlightConnectorTlsNoCertTest, TlsNoCertTest) { |
There was a problem hiding this comment.
Change test names to start with lowercase letters.
| FlightConnectorTlsNoCertTest() | ||
| : FlightConnectorTlsTestBase(std::make_shared<velox::config::ConfigBase>( | ||
| std::unordered_map<std::string, std::string>{ | ||
| {FlightConfig::kServerVerify, "true"}})) {} |
There was a problem hiding this comment.
Seems like this is always a bad config ? Can we error this at the outset itself ?
There was a problem hiding this comment.
This isn't necessarily a bad config. We could be dealing with servers which use certificates signed by trusted CAs, making additional certificates unnecessary.
In our case, this is a purely negative test because our test server is set up with a self-signed certificate, and it's impractical to acquire a certificate signed by a trusted CA just for this purpose.
| .assertResults(makeRowVector({decimalVecBigInt})); | ||
| } | ||
|
|
||
| TEST_F(FlightConnectorDataTypeTest, GeneralTest) { |
e51be47 to
9f48cec
Compare
The native Arrow Flight connector can be used to connect to any Arrow Flight enabled Data Source. The metadata layer is handled by the Presto coordinator and does not need to be re-implemented in C++. Any Java connector that inherits from `presto-base-arrow-flight` can use this connector as it's counterpart for the Prestissimo layer. Different Arrow-Flight enabled data sources can differ in authentication styles. A plugin-style interface is provided to handle such cases with custom authentication code. RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0004-arrow-flight-connector.md#prestissimo-implementation Co-authored-by: Ashwin Kumar <Ashwin.Kumar6@ibm.com> Co-authored-by: Rijin-N <rijin.n@ibm.com> Co-authored-by: Nischay Yadav <Nischay.Yadav@ibm.com>
9f48cec to
bd02070
Compare
| } | ||
|
|
||
| RowVectorPtr FlightDataSource::projectOutputColumns( | ||
| std::shared_ptr<arrow::RecordBatch> input) { |
| using namespace velox; | ||
| using namespace velox::connector; | ||
|
|
||
| // wrapper for CallOptions which doesn't add any members variables |
There was a problem hiding this comment.
Nit: when leaving comments make sure to fllow the comment guidelines
- must be complete sentence (period at end and usually capitalization)
etc
This occurs multiple times in a few files.
| uint64_t completedRows_ = 0; | ||
| uint64_t completedBytes_ = 0; | ||
| std::shared_ptr<auth::Authenticator> authenticator_; | ||
| velox::memory::MemoryPool* pool_; |
There was a problem hiding this comment.
Nit: can be velox::memory::MemoryPool* const pool_;
| explicit ArrowFlightConnectorFactory(const char* name) | ||
| : ConnectorFactory{name} {} | ||
|
|
||
| ArrowFlightConnectorFactory(const char* name, const char* authenticatorName) |
There was a problem hiding this comment.
Why not use std::string_view instead of const char* for the authenticatorName?
Why do we need this constructor in the first place? It doesn't seem to be used at all.
The authenticator comes from the FlightConfig. Assuming there is a different Authenticator implementation why would it not be named/configured in the FlightConfig and instead be hard coded at registration time?
There was a problem hiding this comment.
Why do we need this constructor in the first place?
The idea here is to be able to register different connector names which use the same Flight connector code, but use different Authenticators. This would avoid the need to have two separate config properties to fully qualify the constructor; instead of specifying connector.name and arrow-flight.authenticator.name, you simply specify connector.name and are good to go.
This would tie in a little better with the Java version, where authentication needs to be handled by subclasses of presto-base-arrow-flight, each of which are independent connectors.
Also note that the config arrow-flight.authenticator.name is not present in the base Java Flight connector. A concrete connector extension would need to introduce this config explicitly to be able to switch between between C++ Authenticators within the same connector.
https://github.com/prestodb/presto/pull/23032/files#diff-9b8ea7947728279eacfeb69fc0492cb2390c3dd65897bc9148ecf1d021d37332
| target_link_libraries( | ||
| presto_flight_connector | ||
| velox_connector | ||
| arrow |
There was a problem hiding this comment.
What is arrow referring to here? It doesn't look like a target but something that cam from PkgConfig? How does this differ to ${ArrowFlight_LIBRARIES}?
There was a problem hiding this comment.
Done. Removed arrow, Actually ${ArrowFlight_LIBRARIES} are enough here.
|
|
||
| #define AFC_REGISTER_AUTH_FACTORY(factory) \ | ||
| namespace { \ | ||
| static bool FB_ANONYMOUS_VARIABLE(g_ConnectorFactory) = ::facebook::presto:: \ |
There was a problem hiding this comment.
I'm not understanding the reason for keeping this static bool around - to indicate a successful registration of NoOpAuthenticatorFactory. Once the factory is registered it is in the list. What's the point of having the static bool of the fact that it was registered stored?
The check that the registration was successful was already done (via VELOX_CHECK). Nothing is checking this bool again.
There was a problem hiding this comment.
This variable is not meant to be used anywhere, which is why it's marked as an FB_ANONYMOUS_VARIABLE. This only reason this variable exists is to ensure that it is statically initialized at runtime, and thus ensure that the Authenticator in question is statically registered.
If C++ had static blocks similar to Java, we could have used that instead.
| : Connector{id}, | ||
| flightConfig_{std::make_shared<FlightConfig>(config)}, | ||
| authenticator_{ | ||
| // auth::getAuthenticatorFactory(config_->authenticatorName()) |
| .assertResults(makeRowVector({boolVec})); | ||
| } | ||
|
|
||
| TEST_F(FlightConnectorDataTypeTest, IntType) { |
There was a problem hiding this comment.
better name: integerTypes.
Please see the other comment on naming for the tests (lowerCase) and maybe fix some of the names along the way.
| protected: | ||
| explicit FlightConnectorTestBase( | ||
| std::shared_ptr<velox::config::ConfigBase> config) | ||
| : config_{config} {} |
| install_abseil | ||
| install_grpc | ||
|
|
||
| wget_and_untar https://archive.apache.org/dist/arrow/arrow-${ARROW_VERSION}/apache-arrow-${ARROW_VERSION}.tar.gz arrow |
There was a problem hiding this comment.
Lets use the source from github https://github.com/apache/arrow/archive/apache-arrow-15.0.0.tar.gz. apache.org can have issues with downloads.
steveburnett
left a comment
There was a problem hiding this comment.
I do not find any documentation for me to review.
Should "Arrow Flight connector" be added to the Supported Use Cases topic in the Presto C++ documentation?
|
Hi @majetideepak @BryanCutler @aditi-pandit @czentgr , We have addressed or responded with our reasoning to all the comments you provided. Could you please review and let us know if there is anything else that needs to be done. Thanks. Note: This format checker is working fine locally for us. We're not sure why it is failing in this PR. |
steveburnett
left a comment
There was a problem hiding this comment.
Thank you for the doc! Some suggestions for the data/README.md text.
presto-native-execution/presto_cpp/main/connectors/arrow_flight/tests/data/README.md
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/arrow_flight/tests/data/README.md
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/arrow_flight/tests/data/README.md
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/connectors/arrow_flight/tests/data/README.md
Outdated
Show resolved
Hide resolved
|
|
||
| std::optional<RowVectorPtr> FlightDataSource::next( | ||
| uint64_t size, | ||
| velox::ContinueFuture& future) { |
There was a problem hiding this comment.
Here and in the declaration comment out the unused variable name with new name unused to only leave the type signature.
/* unused */
Doesn't seem like it is used so lets indicate it.
| explicit ArrowFlightConnectorFactory( | ||
| const char* name, | ||
| const char* authenticatorName = nullptr) | ||
| : ConnectorFactory{name}, authenticatorName_{authenticatorName} {} |
There was a problem hiding this comment.
Nit: Lets use () instead of {} for the initializer list for consistency.
|
Hi @Rijin-N @ashkrisk based on the discussion we had with Shweta and Sandhya, we are gonna take over this work, @BryanCutler will help move this PR over the finishing line. We keep everyone who worked on this PR as co-authors. Thank you. cc @tdcmeehan |
|
Hi @ethanyzhang, We have addressed all the comments so far. You can take over from here. |
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Used "View file" in GitHub to view the .md files as they will be published. Looks good, thank you!
|
Superseded by #24504 |
Description
Add Prestissimo support for Arrow Flight connectors.
The native Arrow Flight connector can be used to connect to any Arrow Flight enabled Data Source. The metadata layer is handled by the Presto coordinator and does not need to be re-implemented in C++. Any Java connector that inherits from
presto-base-arrow-flightcan use this connector as it's counterpart for the Prestissimo layer.Different Arrow-Flight enabled data sources can differ in authentication styles. A plugin-style interface is provided to handle such cases with custom authentication code.
Motivation and Context
RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0004-arrow-flight-connector.md#prestissimo-implementation
Impact
Arrow Flight based connectors can be used with Prestissimo
Test Plan
Unit Tests set up a mock arrow flight server and exchange data using the new connector.
Contributor checklist
a GitHub Issuean RFC is referenced.Release Notes
Please follow release notes guidelines and fill in the release notes below.